首页> 外文OA文献 >Efficient Realization of Householder Transform through Algorithm-Architecture Co-design for Acceleration of QR Factorization
【2h】

Efficient Realization of Householder Transform through Algorithm-Architecture Co-design for Acceleration of QR Factorization

机译:家庭住户转型的有效实现   用于加速QR分解的算法 - 体系结构协同设计

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

We present efficient realization of Householder Transform (HT) based QRfactorization through algorithm-architecture co-design where we achieveperformance improvement of 3-90x in-terms of Gflops/watt over state-of-the-artmulticore, General Purpose Graphics Processing Units (GPGPUs), FieldProgrammable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical andexperimental analysis of classical HT is performed for opportunities to exhibithigher degree of parallelism where parallelism is quantified as a number ofparallel operations per level in the Directed Acyclic Graph (DAG) of thetransform. Based on theoretical analysis of classical HT, an opportunityre-arrange computations in the classical HT is identified that results inModified HT (MHT) where it is shown that MHT exhibits 1.33x times higherparallelism than classical HT. Experiments in off-the-shelf multicore andGeneral Purpose Graphics Processing Units (GPGPUs) for HT and MHT suggest thatMHT is capable of achieving slightly better or equal performance compared toclassical HT based QR factorization realizations in the optimized softwarepackages for Dense Linear Algebra (DLA). We implement MHT on a customizedplatform for Dense Linear Algebra (DLA) and show that MHT achieves 1.3x betterperformance than native implementation of classical HT on the same accelerator.For custom realization of HT and MHT based QR factorization, we also identifymacro operations in the DAGs of HT and MHT that are realized on aReconfigurable Data-path (RDP). We also observe that due to re-arrangement inthe computations in MHT, custom realization of MHT is capable of achieving 12%better performance improvement over multicore and GPGPUs than the performanceimprovement reported by General Matrix Multiplication (GEMM) over highly tunedDLA software packages for multicore and GPGPUs which is counter-intuitive.
机译:我们提出了通过算法-架构协同设计有效实现基于Householder Transform(HT)的QR分解的方法,与最先进的多核通用图形处理单元(GPGPU)相比,我们实现了3-90倍Gflops / watt的性能提升),现场可编程门阵列(FPGA)和ClearSpeed CSX700。进行经典HT的理论和实验分析是为了表现出更高的并行度,其中并行度被量化为变换的有向无环图(DAG)中每级的并行操作数量。基于对经典HT的理论分析,确定了经典HT中的机会重排计算,从而导致修改后的HT(MHT),其中表明MHT的并行度比经典HT高1.33倍。在针对HT和MHT的现成的多核和通用图形处理单元(GPGPU)中进行的实验表明,与Dense Linear Algebra(DLA)的优化软件包中基于经典HT的QR因式分解实现相比,MHT能够实现稍好或相等的性能。我们在密集线性代数(DLA)的定制平台上实现了MHT,并证明了MHT的性能比在同一加速器上经典HT的本地实现要好1.3倍。对于HT和基于MHT的QR因式分解的自定义实现,我们还确定了DAG中的宏操作HT和MHT在可重配置数据路径(RDP)上实现。我们还观察到,由于MHT中计算的重新安排,MHT的自定义实现比通用矩阵乘法(GEMM)所报告的性能经过高度调整的DLA软件包对多核和GPGPU的性能提高了12%。 GPGPU违反直觉。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号